Recognition of HTML Table Structure
نویسندگان
چکیده
Tables in HTML Web pages have become precious knowledge sources. Therefore it is reasonable and necessary to develop an algorithm to extract knowledge from them. For this, we need a system to identify the boundary between attributes and values of a table in HTML and transform tables into more understandable attributevalue pairs. In this paper, we propose an algorithm for this purpose. The outline of the algorithm is that if we find a row(or column) having low similarity with other rows (or columns), it is probably an attribute name row (or column), otherwise value data rows(or columns). The algorithm based on this idea results in 82% accuracy of recognition of lengthways and 78% accuracy of recognition of sideways for 300 tables in HTML of Web pages downloaded from the Web.
منابع مشابه
Mining Tables from Large Scale HTML Texts
Table is a very common presentation scheme, but few papers touch on table extraction in text data mining. This paper focuses on mining tables from large-scale HTML texts. Table filtering, recognition, interpretation, and presentation are discussed. Heuristic rules and cell similarities are employed to identify tables. The F-measure of table recognition is 86.50%. We also propose an algorithm to...
متن کاملLayout and Language: Challenges for Table Understanding on the Web
In this paper, we consider the table understanding task and present a catalogue of particular issues that arise when the tables are those found on the web. In addition, we consider what happens when processes commonly associated with web pages are applied to those bearing tables. 1 Table Understanding and the Web The ubiquity of tables, and their ability to describe relational information in a ...
متن کاملNotes on Contemporary Table Recognition
The shift of interest to web tables in HTML and PDF files, coupled with the incorporation of table analysis and conversion routines in commercial desktop document processing software, are likely to turn table recognition into more of a systems than an algorithmic issue. We illustrate the transition by some actual examples of web table conversion. We then suggest that the appropriate target form...
متن کاملAutomating the extraction of data from HTML tables with unknown structure
Data on the Web in HTML tables is mostly structured, but we usually do not know the structure in advance. Thus, we cannot directly query for data of interest. We propose a solution to this problem based on document-independent extraction ontologies. Our solution entails elements of table understanding, data integration, and wrapper creation. Table understanding allows us to find tables of inter...
متن کاملAutomatically Extracting Ontologically Specified Data from HTML Tables of Unknown Structure
Data on the Web in HTML tables is mostly structured, but we usually do not know the structure in advance. Thus, we cannot directly query for data of interest. We propose a solution to this problem based on document-independent extraction ontologies. The solution entails elements of table understanding, data integration, and wrapper creation. Table understanding allows us to recognize attributes...
متن کامل